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Abstract 

Current sequencing teclinology enables generation of whole genome sequencing data sets that contain a high 
density of rare variants, each of which is carried by, at most, 5% of the sampled subjects. Such variants are 
involved in the etiology of most common diseases in humans. These diseases can be studied by relevant 
longitudinal phenotype traits. Tests for association between such genotype information and longitudinal traits 
allow the study of the function of rare variants in complex human disorders. In this paper, we propose an 
association-screening framework that highlights the genotypic differences observed on rare variants and the 
longitudinal nature of phenotypes. In particular, both variants within a gene and longitudinal phenotypes are 
used to create partitions of subjects. Association between the 2 sets of constructed partitions is then evaluated. 
We apply the proposed strategy to the simulated data from the Genetic Analysis Workshop 18 and compare the 
obtained results with those from sequence kernel association test using the receiver operating characteristic 
curves. 



Background 

Rare variants have been speculated to be involved in the 
etiology of complex human diseases [1]. Such diseases 
usually progress over time so that measures of relevant 
traits at different time points can provide information 
on the disease development process. For example, the 
Type 2 Diabetes Genetic Exploration by Next-genera- 
tion sequencing in Ethnic Samples (T2D-GENES) Pro- 
ject 2 aims to identify rare variants influencing 
susceptibility to type 2 diabetes using information from 
whole genome sequencing (WGS) and measurements of 
related traits (such as blood pressure) at up to 4 time 
points. Such WGS genotype and longitudinal phenotype 
data present new challenges to commonly used statisti- 
cal methods for association testing in genome-wide 
studies. 
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Many genetic variants are rare variants (here we are 
referring to rare variants with minor allele frequencies 
[MAFs] <5%). Because of their low MAFs, traditional 
association methods may suffer from low power. A nat- 
ural idea for improving power is grouping or collapsing 
together certain variants. Such collapsing methods are 
based on the assumption that rare variants in a group 
(eg, gene or pathway) may function in combination [2]. 
For example, the sequence kernel association test 
(SKAT) [3] assigns different weights to variants in a 
region and incorporates them into a kernel matrix. We 
have proposed an inverse-probability weighted clustering 
approach [4], a gene-based method where inverse-prob- 
ability weighting is used to overweigh genotypic differ- 
ences observed on rare variants. The above methods can 
deal with both continuous and dichotomous traits and 
have obtained insightful results in different studies. How- 
ever, leveraging them in an effort to efficiently address 
longitudinal traits remains a major obstacle. 
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Longitudinal traits (ie, time-series phenotypes) provide 
valuable information regarding the progression of dis- 
eases. Traditionally, such longitudinal data can be ana- 
lyzed using the so-called cross-sectional strategies. In 
particular, such methods involve repeating the same ana- 
lysis at various, specific points in time. Since at each time 
point the trait under consideration reduces to a scalar, 
methods such as inverse-probability clustering can be 
conducted for association screening. Then, variants can 
be selected based on the results from each time point. 
The assumption underlying this type of strategy is that 
genetic variants maintain similar influences at different 
time points. However, it is more likely that those variants 
influence the pattern of the traits across time; for exam- 
ple, a group of variants may affect how blood pressure 
changes in a time-dependent manner. Cross-sectional 
analysis may fail under such circumstances. A method 
that takes full consideration of the longitudinal nature of 
traits is thus desired to capture such genetics-time 
interactions. 

In this paper, we propose a dual-clustering framework, 
which highlights both rare variants and the longitudinal 
structure of traits. By "dual" clustering, we mean indivi- 
duals are clustered based on both genotypic information 
through inverse-probability weighting and longitudinal 
traits through ordinary hierarchical clustering. Associa- 
tion between the 2 sets of partition labels can then be 
readily evaluated using existing single-marker and sca- 
lar-trait association methods, such as one-way analysis 
of variance (ANOVA) or the partition retention (PR) 
method [5,6]. We apply the proposed approach to the 
simulated data of the Genetic Analysis Workshop 18 
(GAW18) and compare the obtained results with those 
obtained by SKAT. The comparison produces some 
interesting findings. 

Methods 

Data set 

The simulated data set of GAW18 is a combination of 
real WGS data and simulated longitudinal traits. The 
sequence data is drawn from T2D-GENES Project 2. In 
this paper, we use the dosage genotype data on chromo- 
some 3, which include 773,088 single-nucleotide poly- 
morphisms (SNPs) that can be mapped to the genome. 
Two hundred phenotype sets were simulated based on 
genotype data. For each simulated data set, we analyze 
systolic blood pressure (SBP) and diastolic blood pressure 
(DBP), each with measurements at 3 time points, for 849 
related subjects. We map the SNPs to its host gene, 
resulting in 1426 genes. 

Inverse-probability clustering based on genotypes 

Let gii( = 0, 1, or 2 represent the number of minor alleles 
at SNP k for individual i, and let p/^ be the observed 



MAP of SNP k. We define the inverse-probability 
weighted similarity score between individuals i and / 
based on SNP k as: 
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The definition in equation (1) highlights the influence 
of rare variants, and the genotypic similarity on minor 
alleles, but not that on major alleles [4]. For a given gene 
G, the similarity between individuals / and / is defined as 
the sum of the similarity scores on SNPs within that 
gene: sim(!,j) = y^^^^^ sim (t, j; fe). For the 849 indivi- 
duals, pairwise similarity scores, sim(/, /)'s, are evaluated 
first and then converted to a distance measure using the 
transformation d (i, j) = —sim (i, j) + max (sim (i, j)), such 
that the pair with the largest similarity has distance 0. 
Other bounded monotone-decreasing transformations 
can also be applied, such as exponential transformations 
adopted in our previous work [4]. We then conduct hier- 
archical clustering based on the above distances using 
Ward's method [7], and partition individuals into groups 
by cutting the hierarchical clustering tree into a prespeci- 
fied number of groups (we consider partition sizes of 5 to 
10). Figure lA providesan example using MAP4. 

Hierarchical clustering based on longitudinal phenotypes 

The main difficulty of dealing with longitudinal traits is 
that most existing association methods only consider 
scalar phenotypes. Thus it is natural to transform longi- 
tudinal traits into some 1-dimensional summary statis- 
tics. Here we adopt ordinary hierarchical clustering 
using phenotype vectors and treat the resulting class 
labels as a summary statistic. Because hierarchical clus- 
tering uses the whole longitudinal trait as features, we 
expect that it can capture the structure contained in the 
phenotypes. In this study, we cluster the 849 individuals 
into 2 groups. Results show that these 2 groups can be 
treated as with high and low blood pressures (see Figure 
IB). Our main focus here is a strategy that turns longi- 
tudinal traits into 1-dimensional summaries. Other 
dimension-reduction techniques can also be used for 
this task. We choose to adopt hierarchical clustering for 
illustration purpose here because of its simplicity, and 
we get reasonable results (see Results). 
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(a) Clustering based on genotypes 



(b) Clustering based on 
phenotypes 
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Figure 1 Clustering of individuals using SNPs with IVIAFs between 0.01 and 0.05 for MAP4.A, Shown are 10 clusters, with the numbers 
at the top odds ratios within each partition block based on blood pressures. Each row is a SNP, and each column is an individual. SNPs 
are ordered with decreasing MAFs (from top to bottom). Green vertical bars indicate subjects with higher blood pressures (see text). Genotype 
aa is plotted in red, aA is plotted in blue, and AA is plotted in white (a denotes the minor allele). The partitions of the 849 individuals are 
indicated by dotted lines. Most partition elements are driven by similarity on rarer SNPs but not on more common SNPs. B, Clustering of 
individuals using their SBP curves from the first simulation. It can be seen that individuals are reasonably grouped into 1 high blood pressure 
cluster and 1 low blood pressure cluster 



Association analysis based on obtained clusters 

After clustering individuals based on both genotype and 
phenotype, for each gene we test the association between 
the corresponding 2 sets of partition indices. We con- 
sider one-way ANOVA and the partition-retention 
method [5,6]. The partition-retention method is based on 
an association measure / being defined as between an 
outcome variable Y and a partition 11. Specifically, 
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., where «, is the number of indivi- 
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duals in partition element i, and y, is the sample mean of 
element i. y and 5 are the sample mean and standard 
deviation of all n individuals, respectively. Here we take 
the variable that indicates which cluster an individual is 
in from longitudinal traits as Y. Intuitively, PR's / as 
defined above evaluates the amount of influence a parti- 
cular gene has on the longitudinal trait indexed by Y. 

Sequence kernel association test 

We also analyze the data using the linear SKAT [3] for 
comparison purpose. We briefly describe this method 
here. Following the same notation, a similarity score 
between individuals / and / based on SNP k can be 
defined as: sim (i, j; k) = Wkgigj, where is a weight for 
the A:*SNP. The weights (w/^s) are defined based on the 
corresponding MAFs, such that the influence of rare 
variants can be boosted, an idea morally similar to the 



similarity scores defined in equation (1). For a particular 
gene, similarity between 2 individuals can be defined by 
the same summation as in our method. 

SKAT uses the variance-component score statistic 
based on the above similarity scores to test for associa- 
tion between genotypic variants and a scalar trait. We 
treat the cluster indices from the longitudinal traits as 
the response variable in order to apply SKAT to GAW18 
data. More details on SKAT can be found in Ref. [3]. 

Results 

We first apply the proposed method to the WGS dosage 
data including all the 773,088 SNPs and the SBP trait. 
Three genes are discovered after Bonferroni correction, 
of which 1 gene, Y RNA, is significant in 15 of the 200 
replicates. It turns out that this gene resides within 
MAP4, which has the strongest signal in the simulated 
model, and produces a noncoding RNA. 

One reason for the relatively few significant genes 
obtained above may be that there is a very high density 
of variants within most genes. We then conduct similar 
analysis using only SNPs with MAFs between 0.01 and 
0.05 to increase power. SBP and DBP are regressed on 
age, sex, age x sex, and medication, and the residuals 
are used in the clustering analysis. For method compari- 
son, we treat genes containing at least 1 causal SNP in 
the simulated model as causal genes, resulting in 21 
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genes for SBP and 26 genes for DBP. We compare the 
receiver operating characteristic (ROC) curves by the pro- 
posed dual-clustering framework and SKAT (Figure 2). 
SKAT cannot get results for some of the replicates. It can 
be seen from Figure 2 that all the 3 methods have rela- 
tively low power, among which our dual-clustering 
approach with PR's / has a bigger area under curve (AUC). 
Results are similar for other partition sizes resulted from 
inverse-probability clustering. 

Discussion 

We propose a dual-clustering framework for gene-based 
association analysis with WGS and longitudinal traits. 
The first clustering is based on the inverse-probability 
weighted similarities, which automatically increase 
weights for rare variants. The similarity scores are calcu- 
lated from empirical MAF estimates. If better estimates 
are available, the proposed method can incorporate the 
better estimates to achieve improved power. The second 



clustering treats trait vectors of individuals as features, 
which accounts for the longitudinal nature of the phe- 
notypes. Individuals are then partitioned based on their 
genetic similarity on the SNPs in a gene, as well as the 
similarity of their traits. These 2 partitions are then 
used to calculate association between a gene and a long- 
itudinal trait. 

Our proposed framework is actually quite general. We 
define the similarity measure based on inverse-probabil- 
ity weighting. Other similarity measures, such as the one 
used in SKAT, can also be incorporated into our frame- 
work. Other distance-based clustering approaches can be 
adopted for the first clustering based on similarity mea- 
sures. The proposed similarities can detect variants with 
variable directions of the effects. For longitudinal traits, 
we choose hierarchical clustering because of its simpli- 
city. Hierarchical clustering does not take into account 
the correlation induced by time. Considering there are 
only 3 time points in the GAW18 data, we believe that 
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Figure 2 Average ROC curves across simulation replicates for 3 methods. Shown are results by 10 clusters using Inverse-probability 
weighting. Areas under the curve (AUCs) by different methods are compared using paired Wilcoxon signed rank tests based on the 200 
replicates, with the resulting p values shown in the table below, fpr, ie, false positive rate, is the ratio of the number of claimed causal genes and 
the number of true noncausal genes; tpr, ie, true positive rate, is the ratio of the number of claimed causal genes and the number of true causal 
genes. 
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not much information has been lost. If more time points 
are available, time-series clustering methods can be used 
(see Ref [7] for a survey on commonly used time-series 
clustering algorithms). More generally, we use clustering 
as a means of summarization, so other summarization 
strategies can also be integrated into the proposed frame- 
work. After obtaining the 2 sets of clustering indices, any 
association method can be used to measure the associa- 
tion between them. In this paper, we choose ANOVA 
and PR's /. The obtained results are similar but a little 
better than that from SKAT in terms of ROC curves (see 
Figure 2). SKAT shows superiority to more traditional 
methods in the simulation studies presented in Ref. [3]. 
Many of those traditional methods assume that causal 
variants have effects with the same direction and magni- 
tude, and do not consider the potential effects of rarer 
variants to boost power. The purpose of the current 
study is not to show the absolute superiority of our 
method, but rather to present a general framework that 
can incorporate different choices of similarities and asso- 
ciation measures, such as that from SKAT. 

Although the simulation model did not take family 
structures into account, the ANOVA p values may be 
inflated as a consequence of such structures. However, 
p values will be inflated (if any) for both causal and non- 
causal variants. Therefore, the main conclusion based on 
ROC curves is still valid. In practice, we suggest evaluating 
p values using permutations and controlling the false dis- 
covery rate in order to have better sensitivity to real 
genetic signals. This may introduce more computational 
burden, but it is worth mentioning that the 2 clustering 
tasks can be done independently and simultaneously so 
that the computational time can be reduced. Multilevel 
models with Markov chain Monte Carlo (MCMC) techni- 
ques may also address the multiple comparisons problem 
encountered here by partial pooling [8]. 

Conclusions 

The methods we experimented on have relatively low 
power on this particular data set. Our framework 
obtains slightly better results in terms of AUC. It is 
worth applying the proposed methods to other data sets 
for a comprehensive understanding of its performance. 
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